{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import re\n",
    "from IPython.core.interactiveshell import InteractiveShell\n",
    "from IPython.display import display              # 美化输出数组\n",
    "InteractiveShell.ast_node_interactivity = 'all'  # 打印所有单行变量"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 正则表达式\n",
    "正则表达式(regular expression)主要功能是从字符串(string)中通过特定的模式(pattern)，搜索想要找到的内容。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### re.compile(string[, flag]) \n",
    "pattern = re.compile(string[, flag]) 返回pattern对象"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "re.compile(r'python', re.UNICODE)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.compile('python')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### match\n",
    "re.match(pattern, string[, flags=0])\n",
    "\n",
    "从头开始检查字符串是否符合正则表达式。必须从字符串的第一个字符开始就相符。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 传入pattern对象"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(0, 6), match='python'>"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern = re.compile('python')\n",
    "re.match(pattern, 'python markdown')  # -->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 直接传入字符串"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(0, 6), match='python'>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.match('python', 'python markdown')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "re.match('markdown', 'python markdown')  # --> None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### search\n",
    "re.search(pattern, string[, flags=0])\n",
    "\n",
    "搜索整个字符串，直到发现符合的子字符串。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(5, 15), match='aidu.baefu'>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "r = re.search(r'(a.*)(fu)', 'www.baidu.baefu.com')  # 字符串前面的`r` 是指原始字符串，不转译反斜杠`\\`\n",
    "r"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "15"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "r.start()  # --> 匹配到的字符起始位置\n",
    "r.end()    # --> 匹配到的字符串结尾位置"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'aidu.baefu'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "('aidu.bae', 'fu')"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "'aidu.bae'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "'fu'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "r.group()   # --> 缺省为0，返回匹配的字符串\n",
    "r.groups()  # --> 以元组形式返回全部分组截获的字符串\n",
    "r.group(1)  # --> 组中第一个字符串\n",
    "r.group(2)  # --> 组中第二个"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(5, 15)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "r.span() # --> 匹配的字符串起始到结尾"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### sub\n",
    "re.sub(pattern, replacement, string[, count=0])\n",
    "\n",
    "在string中利用正则变换pattern进行搜索，对于搜索到的字符串，用另一字符串replacement替换。`返回替换后的字符串`。  \n",
    "参数：  \n",
    "pattern: 正则中的模式字符串。  \n",
    "repl:    替换的字符串，也可为一个函数。  \n",
    "string:  要被查找替换的原始字符串。  \n",
    "count:   模式匹配后替换的最大次数，默认 0 表示替换所有的匹配。  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'aab333bcc333'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "name = 'aab12bcc11'\n",
    "re.sub(r'\\d+', '333', name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### split\n",
    "\n",
    "re.split(pattern, string[, maxsplit])\n",
    "\n",
    "按照能够匹配的子串将string分割后返回列表。maxsplit用于指定最大分割次数，不指定将全部分割。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['one', 'two', 'three', 'four', '']"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.split(r'\\d+','one1two22three34four4')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### findall\n",
    "\n",
    "re.findall(pattern, string[, flags])\n",
    "\n",
    "根据正则表达式搜索字符串，将所有符合的子字符串放在一给表(list)中返回"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['1', '22', '34', '4']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.findall(r'\\d+','one1two22three34four4')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## flags\n",
    "正则表达式修饰符(可选标志)  \n",
    "多个标志可以通过按位 OR(|) 它们来指定。如 re.I | re.M 被设置成 I 和 M 标志\n",
    "\n",
    "| 修饰符 | 描述|\n",
    "| :---: | :---: |\n",
    "|re.I(IGNORECASE) | 忽略大小写（括号内是完整写法，下同）\n",
    "|re.M(MULTILINE) | 多行模式，改变 '^' 和 '$' 的行为\n",
    "|re.S(DOTALL) | 点任意匹配模式，改变 '.' 的行为\n",
    "|re.L(LOCALE) | 使预定字符类 \\w \\W \\b \\B \\s \\S 取决于当前区域设定\n",
    "|re.U(UNICODE) | 使预定字符类 \\w \\W \\b \\B \\s \\S \\d \\D 取决于unicode定义的字符属性\n",
    "|re.X(VERBOSE) | 详细模式。这个模式下正则表达式可以是多行，忽略空白字符，并可以加入注释。\n",
    "|re.A使预定字符类 | \\w \\W \\b \\B \\s \\S \\d \\D 取决于ascii定义的字符属性"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 正则表达式模式"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 单个字符\n",
    "\n",
    "| 标题 | 标题 |\n",
    "| :---: | :---: |\n",
    "| .     | 匹配除 `\\n` 之外的任何单个字符  \n",
    "| [.\\n] | 匹配所有单个字符  \n",
    "| a<span style=\"border-left:solid 1px;margin:0px 5px\"></span>b | 字符a或字符b  \n",
    "| [afg] | a或者f或者g的一个字符\n",
    "| [0-4] | 0-4范围内的一个字符\n",
    "| [a-f] | a-f范围内的一个字符\n",
    "| [^m]  | 不是m的一个字符\n",
    "| \\w    | 匹配字母数字，等价于 `[A-Za-z0-9_]`\n",
    "| \\W    | 匹配非字母数字，等价于 `[^A-Za-z0-9_]`\n",
    "| \\s    | 匹配任意空白字符，等价于 `[\\t\\n\\r\\f]`\n",
    "| \\S    | 匹配任意非空字符，等价于 `[^ \\f\\n\\r\\t\\v]`\n",
    "| \\d    | 匹配任意数字，等价于 `[0-9]`\n",
    "| \\D    | 匹配任意非数字，等价于 `[^0-9]`\n",
    "\n",
    "> 因为markdown表格中`a|b`的`|`会转义，所以这里用html代替"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 边界匹配\n",
    "\n",
    "|||\n",
    "| :---: | :---: |\n",
    "| \\A    | 匹配字符串开始\n",
    "| \\Z    | 匹配字符串结束，如果是存在换行，只匹配到换行前的结束字符串\n",
    "| \\z    | 匹配字符串结束"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 重复\n",
    "紧跟在单个字符之后，表示多个这样类似的字符\n",
    "\n",
    "| | |\n",
    "| :---: | :---: |\n",
    "| *      | 重复 >=0 次\n",
    "| +      | 重复 >=1 次\n",
    "| ?      | 重复 0或者1 次\n",
    "| {m}    | 重复m次。比如说 a{4}相当于aaaa，再比如说[1-3]{2}相当于[1-3][1-3]\n",
    "| {m,n} | 重复m到n次，小于m次的重复，或者大于n次的重复都不符合条件，中间不能有空格"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 贪婪与非贪婪\n",
    "正则匹配默认是贪婪匹配\n",
    "\n",
    "贪婪：尽可能多的匹配"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(5, 21), match='wo22three34four4'>"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.search(r'w.*4', 'one1two22three34four4')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(0, 5), match='11111'>"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.search(r'1{2,5}', '11111111')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "非贪婪：尽可能少的匹配"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(5, 16), match='wo22three34'>"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.search(r'w.*?4', 'one1two22three34four4')  # 比.*就多了个？号"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 位置\n",
    "\n",
    "| | |\n",
    "| :---: | :---: |\n",
    "| ^ | 字符串的起始位置\n",
    "| $ | 字符串的结尾位置"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 其他\n",
    "\n",
    "| | |\n",
    "| :---: | :---: |\n",
    "| (re) | 匹配括号内的表达式，也表示一个组\n",
    "| `\\n` | 匹配一个换行符\n",
    "| `\\t` | 匹配一个制表符\n",
    "\n",
    "其他转义字符等等"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 前向否定界定符"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(1, 5), match='fwfw'>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(0, 1), match='h'>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(0, 8), match='https://'>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.search(r'(?!a).*', 'afwfw')\n",
    "re.search(r'(?!www)\\S', 'https://www.baidu.com/')\n",
    "re.search(r'((?!www)\\S)+', 'https://www.baidu.com/')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 正则中的(?P<name>…)\n",
    "\n",
    "和普通的用括号分组类似，不同在于这里给匹配的组命名，后续（同一正则表达式内和搜索后得到的Match对象中），都可以通过此group的名字而去引用此group。\n",
    "虽然此处group内命名了，但是其仍然和普通的中一样，可以通过索引号，即group(1),group(2)等等，去引用对应的group的。\n",
    "\n",
    "    group(1)==group(name1)  \n",
    "    group(2)==group(name2)  \n",
    "    "
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4rc1"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "toc_cell": false,
   "toc_position": {
    "height": "839px",
    "left": "0px",
    "right": "1587px",
    "top": "107px",
    "width": "333px"
   },
   "toc_section_display": "block",
   "toc_window_display": true
  },
  "varInspector": {
   "cols": {
    "lenName": "15",
    "lenType": "10",
    "lenVar": "60"
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
